Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

The data provided is a transformed version of the original data which was collected using sensors.

  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.

Both the datasets consist of 40 predictor variables and 1 target variable.

Installing and Importing the necessary libraries¶

In [1]:
# Installing the libraries with the specified version
# !pip install --no-deps tensorflow==2.18.0 scikit-learn==1.3.2 matplotlib===3.8.3 seaborn==0.13.2 numpy==1.26.4 pandas==2.2.2 -q --user --no-warn-script-location

!pip install tensorflow scikit-learn matplotlib seaborn numpy pandas --user

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import time

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Imports tools for data preprocessing including label encoding, one-hot encoding, and standard scaling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler
#Imports a class for imputing missing values in datasets.
from sklearn.impute import SimpleImputer

import tensorflow as tf #An end-to-end open source machine learning platform
from tensorflow import keras  # High-level neural networks API for deep learning.
from keras import backend   # Abstraction layer for neural network backend engines.
from keras.models import Sequential  # Model for building NN sequentially.
from keras.layers import (
    Dense,
    Dropout,
    Activation,
    BatchNormalization  # Layers for building NN.
)

# Libraries to get different metric scores
from sklearn import metrics
from sklearn.utils import class_weight

from sklearn.metrics import (
    confusion_matrix,
    ConfusionMatrixDisplay,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: tensorflow in /Users/ernestholloway/.local/lib/python3.12/site-packages (2.16.2)
Requirement already satisfied: scikit-learn in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (1.6.1)
Requirement already satisfied: matplotlib in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (3.10.0)
Requirement already satisfied: seaborn in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: numpy in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (1.26.4)
Requirement already satisfied: pandas in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (2.2.2)
Requirement already satisfied: absl-py>=1.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.3.1)
Requirement already satisfied: astunparse>=1.6.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (25.2.10)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied: h5py>=3.10.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.14.0)
Requirement already satisfied: libclang>=13.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (18.1.1)
Requirement already satisfied: ml-dtypes~=0.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.3.2)
Requirement already satisfied: opt-einsum>=2.3.2 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.4.0)
Requirement already satisfied: packaging in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (24.1)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (4.25.8)
Requirement already satisfied: requests<3,>=2.21.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.32.3)
Requirement already satisfied: setuptools in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (75.1.0)
Requirement already satisfied: six>=1.12.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.17.0)
Requirement already satisfied: termcolor>=1.1.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.1.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (4.12.2)
Requirement already satisfied: wrapt>=1.11.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.17.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.73.1)
Requirement already satisfied: tensorboard<2.17,>=2.16 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.16.2)
Requirement already satisfied: keras>=3.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.10.0)
Requirement already satisfied: scipy>=1.6.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (1.15.2)
Requirement already satisfied: joblib>=1.2.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: pillow>=8 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from pandas) (2025.2)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from astunparse>=1.6.0->tensorflow) (0.44.0)
Requirement already satisfied: rich in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (14.0.0)
Requirement already satisfied: namex in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (0.1.0)
Requirement already satisfied: optree in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (0.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2025.4.26)
Requirement already satisfied: markdown>=2.6.8 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (3.8.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from rich->keras>=3.0.0->tensorflow) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from rich->keras>=3.0.0->tensorflow) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow) (0.1.2)
2025-08-04 01:40:52.210207: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Global Data Values¶

In [2]:
# Random state for reproducibility
RS = 42

# Validation size for train-validation split
VS = 0.25

# Scoring function for model evaluation. Note to please see above section Important Information About Work Visas for a explanation on why I am choosing recall as the scoring metric.
SCORER = metrics.make_scorer(metrics.recall_score)

# Set class weights as imbalanced data is used
CLASS_WEIGHTS = {0:1.0, 1: 10.0}

EPOCHS = 30
BATCH_SIZE = 16
LEARNING_RATE=0.001
THRESHOLD = 0.5

Common Methods¶

  • To save and reduce code duplication I'm predefining methods that will be used frequently through the EDA and Data Modeling steps. The methods will cover:
    • Confusion Matrix Rendering
    • Correlation Matrix Rendering
    • Box Plot and Histogram Rendering
    • Scatter Plot and Count Plot Rendering
In [3]:
# Function to plot the confusion matrix
# Parameters:
# model: The trained model to evaluate.
# x: Features used for prediction.
# y: True labels for the features.
# title: Optional title for the confusion matrix plot.
def draw_confusion_matrix(model, x, y, title=None):
    y_pred = model.predict(x) > THRESHOLD  # Predict probabilities and convert to binary predictions based on the threshold

    cm = confusion_matrix(y, y_pred, labels=[0, 1])

    # Normalize by row (example)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    # Create labels for each cell in the confusion matrix with both count and percentage
    labels = np.asarray(
        [
            f"{item}\n{item / cm.sum():.2%}"
            for item in cm.flatten()
        ]
    ).reshape(cm.shape)

    # Create the confusion matrix display and turn off the grid
    # and set the display labels to 'No' and 'Yes'
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No', 'Yes'])
    disp.plot(include_values=False)  # Prevent default annotation

    ax = disp.ax_

    # Set the title of the confusion matrix plot if provided
    if title is not None:
        ax.set_title(title)

    for (i, j), label in np.ndenumerate(labels):
        ax.text(j, i, label, ha='center', va='center', color='black', fontsize=12)
    plt.grid(False)  # Turn off the grid

def highlight_strong_correlations(val):
    color = ''
    if abs(val) >= THRESHOLD and abs(val) < 1:  # Exclude self-correlation of 1
        color = 'background-color: green'

    return color

def draw_default_correlation_matrix(data):

    """
    Plot the correlation matrix for the given DataFrame.

    Parameters:
    data (DataFrame): The DataFrame containing the data.
    """

    cols_list = data.select_dtypes(include=np.number).columns.tolist()

    # Calculate the correlation matrix
    correlation_matrix = data[cols_list].corr()

    styled_corr = correlation_matrix.style.applymap(highlight_strong_correlations)
    display(styled_corr)

    # Plot the correlation matrix using a heatmap
    plt.figure(figsize=(64, 32))
    sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1, square=True, cbar_kws={"shrink": .8})
    plt.title('Correlation Matrix')
    plt.show()

#Method to draw box plot and histogram for univariate analysis
# Parameters:
# data: The DataFrame containing the data.
# column_name: The name of the column for which the box plot and histogram will be drawn.
def draw_boxplot_and_histogram(data, column_name):
    plt.figure(figsize=(18, 6))
    
    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data[column_name], bins=30, kde=True)
    plt.title(f'{column_name} Distribution')
    
    # Box Plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x=data[column_name])
    plt.title(f'{column_name} Box Plot')
    
    plt.show()

Note:

  • After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
  • On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

Loading the Data¶

In [4]:
train_data = pd.read_csv("Train.csv")
test_data = pd.read_csv("Test.csv")

#Make a copy of the train and test data
train_data_copy = train_data.copy()
test_data_copy = test_data.copy()

Data Overview¶

In [5]:
# Check that the percentage of the output variable is the same in both train and test datasets
train_percentage = train_data['Target'].value_counts(normalize=True)
test_percentage = test_data['Target'].value_counts(normalize=True)

print("Train Data Output Percentage:\n\n", train_percentage)
print("Test Data Output Percentage:\n", test_percentage)
Train Data Output Percentage:

 Target
0    0.9445
1    0.0555
Name: proportion, dtype: float64
Test Data Output Percentage:
 Target
0    0.9436
1    0.0564
Name: proportion, dtype: float64
  • As a quick sanity check we can see that the percentage of the Target variable false to true ratios as approximately 95% false, to 5% true in both the Training data set and test data set which is good. This means that the test data is representative of the correct population ratio.
  • The challenging thing, however, with this data set is that it is heavily unbalanced which means that there isn't a lot of failure data included in the dataset for both the training and test sets. We will want to use the class_weight option where we can give more weight to the minority output class value of 1, which indicates a wind turbine failure.
  • The data set has a binary classification of either 0 no failure, or 1 for failure for the wind turbine. This means we can use Sigmoid for output layer since it is best at binary classification which this problem set aligns with.
In [6]:
train_data_copy.head(10)  # Display the first 10 rows of the training data copy
Out[6]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.464606 -4.679129 3.101546 0.506130 -0.221083 -2.032511 -2.910870 0.050714 -1.522351 3.761892 -5.714719 0.735893 0.981251 1.417884 -3.375815 -3.047303 0.306194 2.914097 2.269979 4.394876 -2.388299 0.646388 -1.190508 3.132986 0.665277 -2.510846 -0.036744 0.726218 -3.982187 -1.072638 1.667098 3.059700 -1.690440 2.846296 2.235198 6.667486 0.443809 -2.369169 2.950578 -3.480324 0
1 3.365912 3.653381 0.909671 -1.367528 0.332016 2.358938 0.732600 -4.332135 0.565695 -0.101080 1.914465 -0.951458 -1.255259 -2.706522 0.193223 -4.769379 -2.205319 0.907716 0.756894 -5.833678 -3.065122 1.596647 -1.757311 1.766444 -0.267098 3.625036 1.500346 -0.585712 0.783034 -0.201217 0.024883 -1.795474 3.032780 -2.467514 1.894599 -2.297780 -1.731048 5.908837 -0.386345 0.616242 0
2 -3.831843 -5.824444 0.634031 -2.418815 -1.773827 1.016824 -2.098941 -3.173204 -2.081860 5.392621 -0.770673 1.106718 1.144261 0.943301 -3.163804 -4.247825 -4.038909 3.688534 3.311196 1.059002 -2.143026 1.650120 -1.660592 1.679910 -0.450782 -4.550695 3.738779 1.134404 -2.033531 0.840839 -1.600395 -0.257101 0.803550 4.086219 2.292138 5.360850 0.351993 2.940021 3.839160 -4.309402 0
3 1.618098 1.888342 7.046143 -1.147285 0.083080 -1.529780 0.207309 -2.493629 0.344926 2.118578 -3.053023 0.459719 2.704527 -0.636086 -0.453717 -3.174046 -3.404347 -1.281536 1.582104 -1.951778 -3.516555 -1.206011 -5.627854 -1.817653 2.124142 5.294642 4.748137 -2.308536 -3.962977 -6.028730 4.948770 -3.584425 -2.577474 1.363769 0.622714 5.550100 -1.526796 0.138853 3.101430 -1.277378 0
4 -0.111440 3.872488 -3.758361 -2.982897 3.792714 0.544960 0.205433 4.848994 -1.854920 -6.220023 1.998347 4.723757 0.709113 -1.989432 -2.632684 4.184447 2.245356 3.734452 -6.312766 -5.379918 -0.886667 2.061694 9.445586 4.489976 -3.945144 4.582065 -8.780422 -3.382967 5.106507 6.787513 2.044184 8.265896 6.629213 -10.068689 1.222987 -3.229763 1.686909 -2.163896 -3.644622 6.510338 0
5 0.159623 -4.233781 -0.264310 -5.477119 -0.190854 -0.356274 -0.134486 4.066608 -3.858569 1.692441 0.137901 3.974719 0.672853 1.878144 0.764158 4.235913 -2.129272 2.348465 -2.147454 -0.982376 0.386345 1.010637 3.418654 0.996017 0.060580 -3.036740 1.787573 -1.726537 0.307837 1.902350 4.665858 3.227235 0.628900 -1.548860 1.321979 5.461345 1.109410 -3.869993 0.273964 2.805941 0
6 -0.184565 -4.721470 0.864988 -3.078695 -2.226888 -1.282220 -0.804717 3.289733 -1.567971 0.749904 0.528830 3.220564 2.945183 1.724073 -0.923123 2.534830 -1.696713 0.677068 -0.246087 2.747678 -1.165392 0.247621 1.160684 -2.850139 0.503405 -3.532215 1.861243 -1.465354 0.873767 2.418470 0.939376 -0.544941 -0.762921 0.815558 1.889373 3.624347 1.555740 -5.432884 0.678703 0.464697 0
7 1.734840 1.682945 -1.269070 4.600630 -1.416975 -2.543916 0.131648 -0.198661 3.094057 -1.109324 -1.662364 0.943806 3.481045 0.137055 -3.472977 -4.075917 1.726571 -1.908618 3.569249 2.512191 -4.578679 3.062674 3.686149 0.610743 -0.429539 0.880126 -0.993851 1.134221 -3.767917 -0.692236 -5.244396 1.717474 -3.838931 1.569448 1.794899 -4.268517 -0.516195 -0.619218 -0.830889 -4.967266 1
8 1.781583 1.314664 4.248690 -0.518293 -0.149044 0.033082 -1.087893 -3.117561 0.624935 1.567455 -0.415122 -1.400792 2.607063 -1.023519 -2.877902 -4.524080 -4.353952 0.106859 1.298601 -3.595654 -5.409204 0.633421 -3.043436 0.965268 -0.266332 4.670862 1.846717 -2.320822 -1.317705 -0.681722 3.280787 1.611014 2.951390 -1.862016 4.389598 1.371300 -2.516235 0.770496 0.831132 -2.310953 0
9 -0.894140 4.011498 5.251902 3.320747 0.727067 -4.771070 1.031232 3.632080 -1.391444 -1.966746 -4.779273 6.616781 -0.147815 -2.513234 0.734111 0.474710 5.085254 -2.360998 4.561398 2.287065 -2.307024 -0.948690 -0.300906 2.546197 0.738320 4.266330 -4.144926 -0.012559 -1.469495 -2.003484 1.680064 -0.635742 -4.449139 2.296340 1.575110 1.376268 0.596757 -1.413652 0.543871 0.035020 0
In [7]:
test_data_copy.head(10)
Out[7]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613489 -3.819640 2.202302 1.300420 -1.184929 -4.495964 -1.835817 4.722989 1.206140 -0.341909 -5.122874 1.017021 4.818549 3.269001 -2.984330 1.387370 2.032002 -0.511587 -1.023069 7.338733 -2.242244 0.155489 2.053786 -2.772273 1.851369 -1.788696 -0.277282 -1.255143 -3.832886 -1.504542 1.586765 2.291204 -5.411388 0.870073 0.574479 4.157191 1.428093 -10.511342 0.454664 -1.448363 0
1 0.389608 -0.512341 0.527053 -2.576776 -1.016766 2.235112 -0.441301 -4.405744 -0.332869 1.966794 1.796544 0.410490 0.638328 -1.389600 -1.883410 -5.017922 -3.827238 2.418060 1.762285 -3.242297 -3.192960 1.857454 -1.707954 0.633444 -0.587898 0.083683 3.013935 -0.182309 0.223917 0.865228 -1.782158 -2.474936 2.493582 0.315165 2.059288 0.683859 -0.485452 5.128350 1.720744 -1.488235 0
2 -0.874861 -0.640632 4.084202 -1.590454 0.525855 -1.957592 -0.695367 1.347309 -1.732348 0.466500 -4.928214 3.565070 -0.449329 -0.656246 -0.166537 -1.630207 2.291865 2.396492 0.601278 1.793534 -2.120238 0.481968 -0.840707 1.790197 1.874395 0.363930 -0.169063 -0.483832 -2.118982 -2.156586 2.907291 -1.318888 -2.997464 0.459664 0.619774 5.631504 1.323512 -1.752154 1.808302 1.675748 0
3 0.238384 1.458607 4.014528 2.534478 1.196987 -3.117330 -0.924035 0.269493 1.322436 0.702345 -5.578345 -0.850662 2.590525 0.767418 -2.390809 -2.341961 0.571875 -0.933751 0.508677 1.210715 -3.259524 0.104587 -0.658875 1.498107 1.100305 4.142988 -0.248446 -1.136516 -5.355810 -4.545931 3.808667 3.517918 -3.074085 -0.284220 0.954576 3.029331 -1.367198 -3.412140 0.906000 -2.450889 0
4 5.828225 2.768260 -1.234530 2.809264 -1.641648 -1.406698 0.568643 0.965043 1.918379 -2.774855 -0.530016 1.374544 -0.650941 -1.679466 -0.379220 -4.443143 3.893857 -0.607640 2.944931 0.367233 -5.789081 4.597528 4.450264 3.224941 0.396701 0.247765 -2.362047 1.079378 -0.473076 2.242810 -3.591421 1.773841 -1.501573 -2.226702 4.776830 -6.559698 -0.805551 -0.276007 -3.858207 -0.537694 0
5 -1.885713 -1.964160 0.245667 -1.187255 0.027369 -2.214094 -0.605558 3.434368 -2.366542 0.238592 -2.421572 5.443762 1.621775 0.403306 -2.084618 0.838689 1.168480 2.018892 0.968280 1.563233 -2.037208 1.807605 4.216895 2.806480 -0.692462 -1.224669 -1.904662 -0.416139 -1.203544 1.657175 0.658367 3.481445 -1.241937 0.165481 1.938014 3.174898 1.513700 -2.634459 0.694483 -0.169500 0
6 -1.836429 1.216661 -0.186460 0.232731 1.752135 -1.982141 0.637039 3.654029 -2.891643 -0.882726 -2.881859 5.532344 -1.843551 -0.994694 0.602109 1.870065 3.930774 1.278002 1.110149 0.088251 0.226533 1.064542 4.210596 5.268233 -0.754587 0.433090 -4.173879 0.675818 -0.654066 0.612422 1.253968 3.697508 -1.371313 -0.267922 0.385374 1.392039 1.195155 0.104975 -0.258228 1.581771 0
7 -1.649117 0.646787 2.657947 1.395099 0.725959 0.305211 -1.877257 -3.814487 2.273639 0.434063 -2.533155 -3.581302 1.480436 -0.453753 -3.334392 -4.882117 -0.657244 1.311949 -0.487993 0.744801 -2.215417 -0.555530 -3.758861 -0.623616 0.436324 2.663326 0.024642 -0.514574 -1.836844 -2.277864 -0.105820 -1.082314 0.530939 -0.290736 -0.219059 1.364707 -0.565783 0.605945 1.772588 -1.977966 0
8 -2.744431 -5.870927 1.169155 -1.586454 -2.215360 -3.561773 -2.037385 2.782849 -0.687223 1.527678 -4.668574 5.722978 5.746367 2.352832 -5.336724 -2.039904 0.340710 3.045392 1.603427 6.519083 -4.491886 2.848353 3.717535 -1.080943 0.840454 -4.229435 1.391155 -0.403130 -4.380458 -0.042646 -2.192626 0.172074 -5.489960 3.224386 1.433453 6.421956 3.016854 -5.953351 3.084918 -2.982987 0
9 -0.247320 -1.130009 4.584899 0.051528 0.044828 -2.527062 -1.643095 1.042020 -0.059002 0.751700 -4.915543 0.709726 1.810841 0.466331 -2.012662 -2.139615 0.767581 1.047640 0.299762 2.506561 -3.523522 0.229476 -1.363760 0.667588 1.640188 1.301918 0.040996 -1.302753 -2.962745 -2.077191 3.567099 1.133185 -2.171650 -0.245509 2.071918 4.719610 0.033392 -4.396785 1.221412 -0.531737 0
In [8]:
#Check the initial shape of the train and test data
print("Train Data Shape:", train_data_copy.shape)
print("Test Data Shape:", test_data_copy.shape)
Train Data Shape: (20000, 41)
Test Data Shape: (5000, 41)
In [9]:
#Print the info of the train and test data
print("Train Data Info:")
train_data_copy.info()

print("\nTest Data Info:")
test_data_copy.info()
Train Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

Test Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      4995 non-null   float64
 1   V2      4994 non-null   float64
 2   V3      5000 non-null   float64
 3   V4      5000 non-null   float64
 4   V5      5000 non-null   float64
 5   V6      5000 non-null   float64
 6   V7      5000 non-null   float64
 7   V8      5000 non-null   float64
 8   V9      5000 non-null   float64
 9   V10     5000 non-null   float64
 10  V11     5000 non-null   float64
 11  V12     5000 non-null   float64
 12  V13     5000 non-null   float64
 13  V14     5000 non-null   float64
 14  V15     5000 non-null   float64
 15  V16     5000 non-null   float64
 16  V17     5000 non-null   float64
 17  V18     5000 non-null   float64
 18  V19     5000 non-null   float64
 19  V20     5000 non-null   float64
 20  V21     5000 non-null   float64
 21  V22     5000 non-null   float64
 22  V23     5000 non-null   float64
 23  V24     5000 non-null   float64
 24  V25     5000 non-null   float64
 25  V26     5000 non-null   float64
 26  V27     5000 non-null   float64
 27  V28     5000 non-null   float64
 28  V29     5000 non-null   float64
 29  V30     5000 non-null   float64
 30  V31     5000 non-null   float64
 31  V32     5000 non-null   float64
 32  V33     5000 non-null   float64
 33  V34     5000 non-null   float64
 34  V35     5000 non-null   float64
 35  V36     5000 non-null   float64
 36  V37     5000 non-null   float64
 37  V38     5000 non-null   float64
 38  V39     5000 non-null   float64
 39  V40     5000 non-null   float64
 40  Target  5000 non-null   int64  
dtypes: float64(40), int64(1)
memory usage: 1.6 MB
  • The 40 input parameters V1-V40 are all floating point values
  • The 1 output parameter Target is an integer
In [10]:
#Check for missing values in the train and test data
print("\nMissing Values in Train Data:\n", train_data_copy.isnull().sum())
print("\nMissing Values in Test Data:\n", test_data_copy.isnull().sum())

#Print the total number of missing values in the train and test data
print("\nTotal Missing Values in Train Data:", train_data_copy.isnull().sum().sum())
print("Total Missing Values in Test Data:", test_data_copy.isnull().sum().sum())
Missing Values in Train Data:
 V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64

Missing Values in Test Data:
 V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

Total Missing Values in Train Data: 36
Total Missing Values in Test Data: 11
  • There are a total of 36 missing values in the training data set.
  • There are total of 11 missing values in the test data set.
  • Since we need to avoid data leaking we have to split the training set into training and validation sets and then apply the data treatment on training, validation, and test sets separately.
  • After splitting the training data into training and validation sets we confirm that proportion of 0 to 1 is 95% to 5% respectively.
In [11]:
#Check for duplicate rows in the training data
duplicates_train = train_data_copy.duplicated().sum()
print("\nDuplicate Rows in Train Data:", duplicates_train)
Duplicate Rows in Train Data: 0
In [12]:
# Look at the description of the training data
train_data_copy.describe()
Out[12]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
count 19982.000000 19982.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000
mean -0.271996 0.440430 2.484699 -0.083152 -0.053752 -0.995443 -0.879325 -0.548195 -0.016808 -0.012998 -1.895393 1.604825 1.580486 -0.950632 -2.414993 -2.925225 -0.134261 1.189347 1.181808 0.023608 -3.611252 0.951835 -0.366116 1.134389 -0.002186 1.873785 -0.612413 -0.883218 -0.985625 -0.015534 0.486842 0.303799 0.049825 -0.462702 2.229620 1.514809 0.011316 -0.344025 0.890653 -0.875630 0.055500
std 3.441625 3.150784 3.388963 3.431595 2.104801 2.040970 1.761626 3.295756 2.160568 2.193201 3.124322 2.930454 2.874658 1.789651 3.354974 4.221717 3.345462 2.592276 3.396925 3.669477 3.567690 1.651547 4.031860 3.912069 2.016740 3.435137 4.368847 1.917713 2.684365 3.005258 3.461384 5.500400 3.575285 3.183841 2.937102 3.800860 1.788165 3.948147 1.753054 3.012155 0.228959
min -11.876451 -12.319951 -10.708139 -15.082052 -8.603361 -10.227147 -7.949681 -15.657561 -8.596313 -9.853957 -14.832058 -12.948007 -13.228247 -7.738593 -16.416606 -20.374158 -14.091184 -11.643994 -13.491784 -13.922659 -17.956231 -10.122095 -14.866128 -16.387147 -8.228266 -11.834271 -14.904939 -9.269489 -12.579469 -14.796047 -13.722760 -19.876502 -16.898353 -17.985094 -15.349803 -14.833178 -5.478350 -17.375002 -6.438880 -11.023935 0.000000
25% -2.737146 -1.640674 0.206860 -2.347660 -1.535607 -2.347238 -2.030926 -2.642665 -1.494973 -1.411212 -3.922404 -0.396514 -0.223545 -2.170741 -4.415322 -5.634240 -2.215611 -0.403917 -1.050168 -2.432953 -5.930360 -0.118127 -3.098756 -1.468062 -1.365178 -0.337863 -3.652323 -2.171218 -2.787443 -1.867114 -1.817772 -3.420469 -2.242857 -2.136984 0.336191 -0.943809 -1.255819 -2.987638 -0.272250 -2.940193 0.000000
50% -0.747917 0.471536 2.255786 -0.135241 -0.101952 -1.000515 -0.917179 -0.389085 -0.067597 0.100973 -1.921237 1.507841 1.637185 -0.957163 -2.382617 -2.682705 -0.014580 0.883398 1.279061 0.033415 -3.532888 0.974687 -0.262093 0.969048 0.025050 1.950531 -0.884894 -0.891073 -1.176181 0.184346 0.490304 0.052073 -0.066249 -0.255008 2.098633 1.566526 -0.128435 -0.316849 0.919261 -0.920806 0.000000
75% 1.840112 2.543967 4.566165 2.130615 1.340480 0.380330 0.223695 1.722965 1.409203 1.477045 0.118906 3.571454 3.459886 0.270677 -0.359052 -0.095046 2.068751 2.571770 3.493299 2.512372 -1.265884 2.025594 2.451750 3.545975 1.397112 4.130037 2.189177 0.375884 0.629773 2.036229 2.730688 3.761722 2.255134 1.436935 4.064358 3.983939 1.175533 2.279399 2.057540 1.119897 0.000000
max 15.493002 13.089269 17.090919 13.236381 8.133797 6.975847 8.006091 11.679495 8.137580 8.108472 11.826433 15.080698 15.419616 5.670664 12.246455 13.583212 16.756432 13.179863 13.237742 16.052339 13.840473 7.409856 14.458734 17.163291 8.223389 16.836410 17.560404 6.527643 10.722055 12.505812 17.255090 23.633187 16.692486 14.358213 15.291065 19.329576 7.467006 15.289923 7.759877 10.654265 1.000000
  • Since the input data are redacted and we don't have any information about the parameters, it is difficult to read much into anything other than the fact that there are missing data items which need to be imputed for V1 and V2. There isn't enough information available to know for example if negatives values on any of the 40 measurements are plausible.

Exploratory Data Analysis¶

Univariate analysis¶

In [13]:
#Get the list of all columns in train_data_copy
all_columns = train_data_copy.columns.tolist()

#Since there are a total of 40 input variables, and 1 output variable we will perform a loop of the histogram plot and box plot for each of the input variables
for column in all_columns:
    draw_boxplot_and_histogram(train_data_copy, column)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • The input variables V1-V40 are all more or less normally distributed.
  • The input variables V1-V40 all have outliers based on the box plot diagram.
  • The Target variable does not exhibit any sort of normal distribution at all

Bivariate Analysis¶

Histogram¶

In [14]:
draw_default_correlation_matrix(train_data_copy)

#Identify all the variables that have strong correlations with each other
# Note: The threshold for strong correlation is set to 0.5
strong_correlations = train_data_copy.corr().abs() > THRESHOLD
strong_correlations = strong_correlations.where(np.triu(np.ones(strong_correlations.shape), k=1).astype(bool))
strong_correlations = strong_correlations.stack().reset_index()
strong_correlations.columns = ['Variable1', 'Variable2', 'Correlation']
strong_correlations = strong_correlations[strong_correlations['Correlation'] == True]

print("Strong Correlations:\n", strong_correlations)
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
V1 1.000000 0.313593 0.388855 -0.294832 -0.516085 0.175518 0.480694 -0.361016 0.398778 -0.022043 0.291310 -0.144904 0.056089 -0.273683 0.413259 -0.334787 -0.347788 -0.390056 0.127776 -0.341331 -0.392011 0.207272 -0.436782 -0.512832 0.675602 0.222721 0.683816 -0.280558 -0.062311 -0.464534 0.084862 -0.633279 -0.289917 -0.019433 0.142695 -0.124949 -0.350610 0.148316 -0.120898 0.212632 0.073307
V2 0.313593 1.000000 0.095185 0.290202 0.383785 0.233949 0.455632 -0.383237 0.280601 -0.483879 0.158944 -0.159064 -0.381918 -0.853530 0.221681 -0.241576 0.164679 -0.303630 0.119098 -0.589420 -0.064819 -0.096020 -0.181389 0.221934 -0.127100 0.787440 -0.204437 0.032709 0.179821 -0.216071 -0.042449 -0.115820 0.203387 -0.281097 -0.054777 -0.580675 -0.437349 0.655368 -0.350539 0.155617 -0.000946
V3 0.388855 0.095185 1.000000 -0.028828 -0.359628 -0.291644 -0.156267 -0.412009 0.233626 0.446161 -0.334642 -0.166270 0.329552 -0.222967 -0.061598 -0.533497 -0.413890 -0.267845 0.402367 -0.039521 -0.658327 -0.194010 -0.785580 -0.265330 0.595676 0.459761 0.499957 -0.411772 -0.445440 -0.607322 0.463729 -0.367431 -0.219509 0.225753 0.501409 0.438341 -0.502482 -0.073865 0.527742 -0.306190 -0.213855
V4 -0.294832 0.290202 -0.028828 1.000000 0.084185 -0.470199 -0.196909 0.034906 0.265154 -0.107058 -0.363612 -0.235211 -0.272949 -0.221931 -0.150716 -0.194471 0.606701 -0.413616 0.596391 0.412007 -0.085815 -0.033303 0.036908 0.516096 -0.269900 0.106663 -0.588490 0.660283 -0.186060 0.121724 -0.368177 0.383456 -0.052216 0.297496 0.340764 -0.557958 -0.356650 0.090986 -0.389080 -0.665310 0.110786
V5 -0.516085 0.383785 -0.359628 0.084185 1.000000 0.156161 -0.078436 0.168267 -0.297635 -0.343741 -0.212215 -0.018023 -0.333497 -0.146212 -0.146589 0.266994 0.328192 0.432620 -0.504478 -0.360510 0.383959 -0.089915 0.456634 0.662638 -0.602529 0.405462 -0.662801 -0.034228 0.093092 0.141055 0.301930 0.619779 0.458888 -0.607112 -0.341275 -0.045510 0.064515 0.171836 -0.217778 0.335332 -0.100525
V6 0.175518 0.233949 -0.291644 -0.470199 0.156161 1.000000 0.210914 -0.559084 0.084554 -0.116887 0.710480 -0.395911 -0.229053 -0.346696 0.145335 -0.084184 -0.454301 0.286163 -0.418803 -0.695236 0.223402 -0.068138 -0.186560 -0.198847 -0.190472 0.147217 0.217310 -0.182653 0.586885 0.153105 -0.115764 -0.292468 0.587371 -0.401306 -0.317068 -0.247402 -0.067280 0.628722 -0.025458 0.423882 0.000237
V7 0.480694 0.455632 -0.156267 -0.196909 -0.078436 0.210914 1.000000 0.092713 -0.244421 -0.176849 0.530656 0.397331 -0.493828 -0.323227 0.867871 0.401290 0.027866 -0.561630 0.135626 -0.413238 0.470703 -0.277780 -0.050080 -0.210543 0.201945 0.023205 0.289410 0.189329 0.311074 -0.250387 -0.034001 -0.458844 -0.267082 0.232444 -0.438937 -0.280033 -0.007940 0.469057 -0.236700 0.479354 0.236907
V8 -0.361016 -0.383237 -0.412009 0.034906 0.168267 -0.559084 0.092713 1.000000 -0.611735 -0.179856 -0.193942 0.674030 -0.104306 0.545237 0.176102 0.802505 0.514604 -0.025594 -0.151573 0.440875 0.484455 -0.049784 0.717858 0.250453 -0.202921 -0.502977 -0.418680 0.172116 0.058335 0.360815 0.149730 0.471798 -0.251002 0.054575 -0.167235 0.155913 0.522797 -0.614964 -0.344376 0.256984 0.135996
V9 0.398778 0.280601 0.233626 0.265154 -0.297635 0.084554 -0.244421 -0.611735 1.000000 -0.293048 -0.090165 -0.629641 0.391041 -0.238225 -0.393782 -0.752931 -0.040711 -0.045757 0.041111 0.126822 -0.596978 0.318376 -0.344270 -0.390001 0.314500 0.316018 0.173953 -0.099764 -0.235584 -0.282570 -0.458116 -0.369241 -0.137518 -0.102147 0.117301 -0.351657 -0.200198 0.038099 0.001001 -0.308607 0.008124
V10 -0.022043 -0.483879 0.446161 -0.107058 -0.343741 -0.116887 -0.176849 -0.179856 -0.293048 1.000000 -0.156346 -0.119735 0.068491 0.343507 0.080268 -0.117692 -0.509791 -0.144308 0.414314 0.056561 -0.125682 -0.174525 -0.458001 -0.001608 0.276199 -0.223186 0.502179 0.033785 -0.434624 -0.314250 0.403991 -0.018839 -0.110081 0.514227 0.346129 0.561161 -0.403067 -0.007520 0.560471 -0.474803 -0.051263
V11 0.291310 0.158944 -0.334642 -0.363612 -0.212215 0.710480 0.530656 -0.193942 -0.090165 -0.156346 1.000000 -0.004893 -0.178650 -0.275305 0.411599 0.276387 -0.437216 -0.244070 -0.093482 -0.535816 0.336133 -0.261462 -0.111463 -0.377024 -0.217322 -0.057829 0.252830 -0.087538 0.811228 0.371119 -0.231534 -0.363715 0.422481 -0.062297 -0.169441 -0.427396 -0.045318 0.471418 -0.239641 0.336767 0.196715
V12 -0.144904 -0.159064 -0.166270 -0.235211 -0.018023 -0.395911 0.397331 0.674030 -0.629641 -0.119735 -0.004893 1.000000 -0.010808 0.130274 0.264978 0.562087 0.243546 -0.047265 0.064590 0.147514 0.281808 0.016347 0.454646 0.036910 -0.052367 -0.297273 -0.083374 0.043959 0.080279 0.137589 0.033878 0.008758 -0.355644 0.237594 -0.194636 0.242611 0.548449 -0.206345 0.053823 0.307470 -0.021807
V13 0.056089 -0.381918 0.329552 -0.272949 -0.333497 -0.229053 -0.493828 -0.104306 0.391041 0.068491 -0.178650 -0.010808 1.000000 0.367444 -0.684788 -0.314649 -0.458437 0.255678 -0.198617 0.226223 -0.607169 0.277293 -0.055490 -0.492305 0.160221 0.084519 0.301270 -0.657255 -0.339866 -0.079904 0.024917 -0.102412 -0.111937 -0.112675 0.224130 0.422558 0.168500 -0.559274 0.486516 -0.258869 -0.139718
V14 -0.273683 -0.853530 -0.222967 -0.221931 -0.146212 -0.346696 -0.323227 0.545237 -0.238225 0.343507 -0.275305 0.130274 0.367444 1.000000 -0.157487 0.404744 -0.030553 0.220037 -0.302476 0.550226 0.208446 0.092585 0.383793 -0.149375 0.115315 -0.674520 0.118412 -0.054306 -0.353208 0.044626 0.172368 0.276829 -0.321509 0.142082 -0.161742 0.547594 0.422413 -0.762684 0.184663 -0.053814 0.117586
V15 0.413259 0.221681 -0.061598 -0.150716 -0.146589 0.145335 0.867871 0.176102 -0.393782 0.080268 0.411599 0.264978 -0.684788 -0.157487 1.000000 0.470695 0.073532 -0.590202 0.213037 -0.245100 0.567302 -0.439281 -0.179441 -0.138497 0.333670 -0.191763 0.314699 0.323758 0.280440 -0.252483 0.138395 -0.394717 -0.287010 0.364587 -0.337384 -0.133556 -0.084773 0.347191 -0.238364 0.438196 0.249118
V16 -0.334787 -0.241576 -0.533497 -0.194471 0.266994 -0.084184 0.401290 0.802505 -0.752931 -0.117692 0.276387 0.562087 -0.314649 0.404744 0.470695 1.000000 0.216777 -0.130780 -0.279313 0.036503 0.836527 -0.437891 0.532444 0.102879 -0.321164 -0.426965 -0.252718 0.122186 0.416053 0.359406 0.210411 0.289538 0.014966 0.040945 -0.447736 0.080209 0.430631 -0.235650 -0.287419 0.477094 0.230507
V17 -0.347788 0.164679 -0.413890 0.606701 0.328192 -0.454301 0.027866 0.514604 -0.040711 -0.509791 -0.437216 0.243546 -0.458437 -0.030553 0.073532 0.216777 1.000000 -0.019526 0.092045 0.511172 0.303739 0.142651 0.535777 0.492579 -0.169761 -0.216592 -0.706598 0.659308 -0.059505 0.171504 -0.363612 0.344394 -0.307117 0.070968 -0.213708 -0.374603 0.348304 -0.131721 -0.511091 0.038264 0.085314
V18 -0.390056 -0.303630 -0.267845 -0.413616 0.432620 0.286163 -0.561630 -0.025594 -0.045757 -0.144308 -0.244070 -0.047265 0.255678 0.220037 -0.590202 -0.130780 -0.019526 1.000000 -0.693428 -0.034312 -0.082599 0.466859 0.396764 0.169876 -0.251902 -0.064678 -0.215565 -0.325600 -0.039814 0.235128 0.048011 0.287090 0.314099 -0.602106 -0.190625 0.356522 0.479581 -0.189250 0.225480 0.286388 -0.293340
V19 0.127776 0.119098 0.402367 0.596391 -0.504478 -0.418803 0.135626 -0.151573 0.041111 0.414314 -0.093482 0.064590 -0.198617 -0.302476 0.213037 -0.279313 0.092045 -0.693428 1.000000 0.246063 -0.267267 -0.111111 -0.428241 0.136032 0.173524 -0.024596 0.095771 0.529055 -0.213289 -0.138990 -0.219658 -0.147915 -0.286116 0.756188 0.553275 -0.240156 -0.505795 0.266142 0.031867 -0.699379 0.053897
V20 -0.341331 -0.589420 -0.039521 0.412007 -0.360510 -0.695236 -0.413238 0.440875 0.126822 0.056561 -0.535816 0.147514 0.226223 0.550226 -0.245100 0.036503 0.511172 -0.034312 0.246063 1.000000 -0.047137 0.116157 0.201459 -0.069796 0.208989 -0.623512 -0.179680 0.413681 -0.376122 0.065311 -0.367301 0.096048 -0.580161 0.503773 0.059147 0.143676 0.426708 -0.646999 0.079338 -0.412649 0.070803
V21 -0.392011 -0.064819 -0.658327 -0.085815 0.383959 0.223402 0.470703 0.484455 -0.596978 -0.125682 0.336133 0.281808 -0.607169 0.208446 0.567302 0.836527 0.303739 -0.082599 -0.267267 -0.047137 1.000000 -0.507400 0.381939 0.143596 -0.321902 -0.410914 -0.258423 0.406565 0.453329 0.229215 -0.021029 0.142997 0.054475 0.111245 -0.700400 -0.102281 0.386868 0.155868 -0.242004 0.470061 0.256411
V22 0.207272 -0.096020 -0.194010 -0.033303 -0.089915 -0.068138 -0.277780 -0.049784 0.318376 -0.174525 -0.261462 0.016347 0.277293 0.092585 -0.439281 -0.437891 0.142651 0.466859 -0.111111 0.116157 -0.507400 1.000000 0.421247 0.158724 0.035946 -0.104899 -0.048791 -0.015530 -0.343337 0.100537 -0.294896 0.155642 -0.122773 -0.312228 0.227197 -0.091865 0.145478 -0.158051 -0.173739 -0.093564 -0.134727
V23 -0.436782 -0.181389 -0.785580 0.036908 0.456634 -0.186560 -0.050080 0.717858 -0.344270 -0.458001 -0.111463 0.454646 -0.055490 0.383793 -0.179441 0.532444 0.535777 0.396764 -0.428241 0.201459 0.381939 0.421247 1.000000 0.442574 -0.546003 -0.345841 -0.628294 0.177058 0.102085 0.541924 -0.166827 0.633804 0.051773 -0.355795 -0.275453 -0.136416 0.571951 -0.341177 -0.478436 0.268313 0.071042
V24 -0.512832 0.221934 -0.265330 0.516096 0.662638 -0.198847 -0.210543 0.250453 -0.390001 -0.001608 -0.377024 0.036910 -0.492305 -0.149375 -0.138497 0.102879 0.492579 0.169876 0.136032 -0.069796 0.143596 0.158724 0.442574 1.000000 -0.613548 0.156184 -0.755335 0.408712 -0.084531 0.321787 0.158342 0.825119 0.359401 -0.220454 0.249089 -0.210090 -0.243950 0.168662 -0.401241 -0.197865 -0.091242
V25 0.675602 -0.127100 0.595676 -0.269900 -0.602529 -0.190472 0.201945 -0.202921 0.314500 0.276199 -0.217322 -0.052367 0.160221 0.115315 0.333670 -0.321164 -0.169761 -0.251902 0.173524 0.208989 -0.321902 0.035946 -0.546003 -0.613548 1.000000 -0.108421 0.766255 -0.138697 -0.469705 -0.764734 0.145528 -0.711082 -0.735157 0.373514 -0.039913 0.392944 -0.029111 -0.194906 0.370801 0.078735 -0.001440
V26 0.222721 0.787440 0.459761 0.106663 0.405462 0.147217 0.023205 -0.502977 0.316018 -0.223186 -0.057829 -0.297273 0.084519 -0.674520 -0.191763 -0.426965 -0.216592 -0.064678 -0.024596 -0.623512 -0.410914 -0.104899 -0.345841 0.156184 -0.108421 1.000000 -0.079879 -0.453507 -0.048861 -0.296300 0.331466 0.011615 0.367059 -0.460704 0.207572 -0.149467 -0.559232 0.376878 0.018593 -0.002513 -0.180469
V27 0.683816 -0.204437 0.499957 -0.588490 -0.662801 0.217310 0.289410 -0.418680 0.173953 0.502179 0.252830 -0.083374 0.301270 0.118412 0.314699 -0.252718 -0.706598 -0.215565 0.095771 -0.179680 -0.258423 -0.048791 -0.628294 -0.755335 0.766255 -0.079879 1.000000 -0.360284 -0.222870 -0.603725 0.185935 -0.765733 -0.380365 0.324211 -0.029566 0.424834 -0.148273 0.051612 0.542822 0.067686 0.014891
V28 -0.280558 0.032709 -0.411772 0.660283 -0.034228 -0.182653 0.189329 0.172116 -0.099764 0.033785 -0.087538 0.043959 -0.657255 -0.054306 0.323758 0.122186 0.659308 -0.325600 0.529055 0.413681 0.406565 -0.015530 0.177058 0.408712 -0.138697 -0.453507 -0.360284 1.000000 -0.009382 0.133319 -0.547992 0.131296 -0.258229 0.561588 -0.104315 -0.480046 0.015925 0.298395 -0.335742 -0.326313 0.207359
V29 -0.062311 0.179821 -0.445440 -0.186060 0.093092 0.586885 0.311074 0.058335 -0.235584 -0.434624 0.811228 0.080279 -0.339866 -0.353208 0.280440 0.416053 -0.059505 -0.039814 -0.213289 -0.376122 0.453329 -0.343337 0.102085 -0.084531 -0.469705 -0.048861 -0.222870 -0.009382 1.000000 0.670054 -0.220486 -0.079502 0.596903 -0.251588 -0.155400 -0.485512 0.147663 0.334013 -0.427830 0.455882 0.108342
V30 -0.464534 -0.216071 -0.607322 0.121724 0.141055 0.153105 -0.250387 0.360815 -0.282570 -0.314250 0.371119 0.137589 -0.079904 0.044626 -0.252483 0.359406 0.171504 0.235128 -0.138990 0.065311 0.229215 0.100537 0.541924 0.321787 -0.764734 -0.296300 -0.603725 0.133319 0.670054 1.000000 -0.305334 0.506868 0.611668 -0.280160 0.207840 -0.405508 0.231476 -0.090686 -0.508283 0.012973 0.038867
V31 0.084862 -0.042449 0.463729 -0.368177 0.301930 -0.115764 -0.034001 0.149730 -0.458116 0.403991 -0.231534 0.033878 0.024917 0.172368 0.138395 0.210411 -0.363612 0.048011 -0.219658 -0.367301 -0.021029 -0.294896 -0.166827 0.158342 0.145528 0.331466 0.185935 -0.547992 -0.220486 -0.305334 1.000000 0.244383 0.144410 -0.276676 0.160888 0.627929 -0.319980 -0.232866 0.197549 0.245345 -0.136951
V32 -0.633279 -0.115820 -0.367431 0.383456 0.619779 -0.292468 -0.458844 0.471798 -0.369241 -0.018839 -0.363715 0.008758 -0.102412 0.276829 -0.394717 0.289538 0.344394 0.287090 -0.147915 0.096048 0.142997 0.155642 0.633804 0.825119 -0.711082 0.011615 -0.765733 0.131296 -0.079502 0.506868 0.244383 1.000000 0.425631 -0.368878 0.252875 -0.047092 -0.076385 -0.273652 -0.390348 -0.207478 -0.032793
V33 -0.289917 0.203387 -0.219509 -0.052216 0.458888 0.587371 -0.267082 -0.251002 -0.137518 -0.110081 0.422481 -0.355644 -0.111937 -0.321509 -0.287010 0.014966 -0.307117 0.314099 -0.286116 -0.580161 0.054475 -0.122773 0.051773 0.359401 -0.735157 0.367059 -0.380365 -0.258229 0.596903 0.611668 0.144410 0.425631 1.000000 -0.605510 0.239364 -0.279235 -0.304743 0.337791 -0.247687 0.062731 -0.102548
V34 -0.019433 -0.281097 0.225753 0.297496 -0.607112 -0.401306 0.232444 0.054575 -0.102147 0.514227 -0.062297 0.237594 -0.112675 0.142082 0.364587 0.040945 0.070968 -0.602106 0.756188 0.503773 0.111245 -0.312228 -0.355795 -0.220454 0.373514 -0.460704 0.324211 0.561588 -0.251588 -0.280160 -0.276676 -0.368878 -0.605510 1.000000 0.043479 0.090631 -0.029484 0.053792 0.339965 -0.490010 0.153854
V35 0.142695 -0.054777 0.501409 0.340764 -0.341275 -0.317068 -0.438937 -0.167235 0.117301 0.346129 -0.169441 -0.194636 0.224130 -0.161742 -0.337384 -0.447736 -0.213708 -0.190625 0.553275 0.059147 -0.700400 0.227197 -0.275453 0.249089 -0.039913 0.207572 -0.029566 -0.104315 -0.155400 0.207840 0.160888 0.252875 0.239364 0.043479 1.000000 -0.065047 -0.623487 -0.124098 -0.096356 -0.623920 -0.145603
V36 -0.124949 -0.580675 0.438341 -0.557958 -0.045510 -0.247402 -0.280033 0.155913 -0.351657 0.561161 -0.427396 0.242611 0.422558 0.547594 -0.133556 0.080209 -0.374603 0.356522 -0.240156 0.143676 -0.102281 -0.091865 -0.136416 -0.210090 0.392944 -0.149467 0.424834 -0.480046 -0.485512 -0.405508 0.627929 -0.047092 -0.279235 0.090631 -0.065047 1.000000 0.237905 -0.485314 0.751734 0.100848 -0.216453
V37 -0.350610 -0.437349 -0.502482 -0.356650 0.064515 -0.067280 -0.007940 0.522797 -0.200198 -0.403067 -0.045318 0.548449 0.168500 0.422413 -0.084773 0.430631 0.348304 0.479581 -0.505795 0.426708 0.386868 0.145478 0.571951 -0.243950 -0.029111 -0.559232 -0.148273 0.015925 0.147663 0.231476 -0.319980 -0.076385 -0.304743 -0.029484 -0.623487 0.237905 1.000000 -0.407308 0.119262 0.472608 -0.004769
V38 0.148316 0.655368 -0.073865 0.090986 0.171836 0.628722 0.469057 -0.614964 0.038099 -0.007520 0.471418 -0.206345 -0.559274 -0.762684 0.347191 -0.235650 -0.131721 -0.189250 0.266142 -0.646999 0.155868 -0.158051 -0.341177 0.168662 -0.194906 0.376878 0.051612 0.298395 0.334013 -0.090686 -0.232866 -0.273652 0.337791 0.053792 -0.124098 -0.485314 -0.407308 1.000000 -0.048431 0.024597 0.003584
V39 -0.120898 -0.350539 0.527742 -0.389080 -0.217778 -0.025458 -0.236700 -0.344376 0.001001 0.560471 -0.239641 0.053823 0.486516 0.184663 -0.238364 -0.287419 -0.511091 0.225480 0.031867 0.079338 -0.242004 -0.173739 -0.478436 -0.401241 0.370801 0.018593 0.542822 -0.335742 -0.427830 -0.508283 0.197549 -0.390348 -0.247687 0.339965 -0.096356 0.751734 0.119262 -0.048431 1.000000 -0.191764 -0.227264
V40 0.212632 0.155617 -0.306190 -0.665310 0.335332 0.423882 0.479354 0.256984 -0.308607 -0.474803 0.336767 0.307470 -0.258869 -0.053814 0.438196 0.477094 0.038264 0.286388 -0.699379 -0.412649 0.470061 -0.093564 0.268313 -0.197865 0.078735 -0.002513 0.067686 -0.326313 0.455882 0.012973 0.245345 -0.207478 0.062731 -0.490010 -0.623920 0.100848 0.472608 0.024597 -0.191764 1.000000 0.007802
Target 0.073307 -0.000946 -0.213855 0.110786 -0.100525 0.000237 0.236907 0.135996 0.008124 -0.051263 0.196715 -0.021807 -0.139718 0.117586 0.249118 0.230507 0.085314 -0.293340 0.053897 0.070803 0.256411 -0.134727 0.071042 -0.091242 -0.001440 -0.180469 0.014891 0.207359 0.108342 0.038867 -0.136951 -0.032793 -0.102548 0.153854 -0.145603 -0.216453 -0.004769 0.003584 -0.227264 0.007802 1.000000
No description has been provided for this image
Strong Correlations:
     Variable1 Variable2 Correlation
3          V1        V5        True
22         V1       V24        True
23         V1       V25        True
25         V1       V27        True
30         V1       V32        True
51         V2       V14        True
57         V2       V20        True
63         V2       V26        True
73         V2       V36        True
75         V2       V38        True
91         V3       V16        True
96         V3       V21        True
98         V3       V23        True
100        V3       V25        True
105        V3       V30        True
110        V3       V35        True
112        V3       V37        True
114        V3       V39        True
129        V4       V17        True
131        V4       V19        True
136        V4       V24        True
139        V4       V27        True
140        V4       V28        True
148        V4       V36        True
152        V4       V40        True
167        V5       V19        True
172        V5       V24        True
173        V5       V25        True
175        V5       V27        True
180        V5       V32        True
182        V5       V34        True
191        V6        V8        True
194        V6       V11        True
203        V6       V20        True
212        V6       V29        True
216        V6       V33        True
221        V6       V38        True
228        V7       V11        True
232        V7       V15        True
235        V7       V18        True
259        V8        V9        True
262        V8       V12        True
264        V8       V14        True
266        V8       V16        True
267        V8       V17        True
273        V8       V23        True
276        V8       V26        True
287        V8       V37        True
288        V8       V38        True
294        V9       V12        True
298        V9       V16        True
303        V9       V21        True
330       V10       V17        True
340       V10       V27        True
347       V10       V34        True
349       V10       V36        True
352       V10       V39        True
363       V11       V20        True
372       V11       V29        True
388       V12       V16        True
409       V12       V37        True
415       V13       V15        True
421       V13       V21        True
428       V13       V28        True
438       V13       V38        True
447       V14       V20        True
453       V14       V26        True
463       V14       V36        True
465       V14       V38        True
471       V15       V18        True
474       V15       V21        True
499       V16       V21        True
501       V16       V23        True
522       V17       V20        True
525       V17       V23        True
529       V17       V27        True
530       V17       V28        True
541       V17       V39        True
544       V18       V19        True
559       V18       V34        True
575       V19       V28        True
581       V19       V34        True
582       V19       V35        True
584       V19       V37        True
587       V19       V40        True
594       V20       V26        True
601       V20       V33        True
602       V20       V34        True
606       V20       V38        True
610       V21       V22        True
623       V21       V35        True
650       V23       V25        True
652       V23       V27        True
655       V23       V30        True
657       V23       V32        True
662       V23       V37        True
667       V24       V25        True
669       V24       V27        True
674       V24       V32        True
685       V25       V27        True
688       V25       V30        True
690       V25       V32        True
691       V25       V33        True
710       V26       V37        True
717       V27       V30        True
719       V27       V32        True
726       V27       V39        True
731       V28       V31        True
734       V28       V34        True
742       V29       V30        True
745       V29       V33        True
755       V30       V32        True
756       V30       V33        True
762       V30       V39        True
769       V31       V36        True
784       V33       V34        True
800       V35       V37        True
803       V35       V40        True
807       V36       V39        True
  • There are 40 input variables that all have complex interactions between the output variable, but there is no strong positive or negative correlation between any of the input variables and Target for the output variable.
  • The variables that have strong correlations (positive or negative) are listed above.
  • Since there are many variables with no clear correlation or strong correlation with the output variable this will be the extent of the bi-variate analysis as going further will add no additional insight.

Data Preprocessing¶

Train Validation Split¶

In [15]:
# defining the dependent and independent variables
X = train_data_copy.drop(["Target"], axis=1)
y = train_data_copy["Target"]

# Splitting the data into training and validation sets. We need to use stratify to maintain the same distribution of the target variable in both sets.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VS,random_state=RS, stratify=y, shuffle=True)

X_test = test_data_copy.drop(["Target"], axis=1)
y_test = test_data_copy["Target"]

# Check that the percentage of the output variable is the same in both train and validation datasets
train_percentage = y_train.value_counts(normalize=True)
val_percentage = y_val.value_counts(normalize=True)

print("Train Data Output Percentage:\n\n", train_percentage)
print("Validation Data Output Percentage:\n", val_percentage)

# Print the shape of the training and validation sets
print("Training Data Shape:", X_train.shape, y_train.shape)
print("Validation Data Shape:", X_val.shape, y_val.shape)
Train Data Output Percentage:

 Target
0    0.944467
1    0.055533
Name: proportion, dtype: float64
Validation Data Output Percentage:
 Target
0    0.9446
1    0.0554
Name: proportion, dtype: float64
Training Data Shape: (15000, 40) (15000,)
Validation Data Shape: (5000, 40) (5000,)

Initial Data Set Treatment¶

In [16]:
#Convert Target variable to an Float64 on the training, validation, and test sets to be 100% consistent
y_train = y_train.astype(float)
y_val = y_val.astype(float)
y_test = y_test.astype(float)

# Impute the missing values in the training, validation, and test sets using the median of each column.
# The median is used to avoid the influence of outliers. We use SimpleImputer to systematically handle missing values across multiple columns in a dataset.
imputer = SimpleImputer(strategy="median")

X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_val = pd.DataFrame(imputer.fit_transform(X_val), columns=X_val.columns)
X_test = pd.DataFrame(imputer.fit_transform(X_test), columns=X_test.columns)

#Confirm that there are no missing values in the training, validation, and test sets
print("Missing Values in Training Set:", X_train.isnull().sum().sum())
print("Missing Values in Validation Set:", X_val.isnull().sum().sum())
print("Missing Values in Test Set:", X_test.isnull().sum().sum())
Missing Values in Training Set: 0
Missing Values in Validation Set: 0
Missing Values in Test Set: 0

Model Building¶

Model Evaluation Metrics and Plotting Functions¶

  • To save and reduce code duplication I'm predefining methods that will be used frequently in thd data Modeling steps. The methods will cover:
    • Performance metric calculations
    • Model performance plots
In [17]:
# Method to calculate difference in performance metrics between two model lists
# Parameters:
# model_training_metrics_list (list): First list of models to compare.
# model_validation_metrics_list (list): Second list of models to compare.
# Returns:
# A DataFrame containing the difference in performance metrics between the two model lists.
def performance_metrics_difference(model_training_metrics_list, model_validation_metrics_list):
     #Initialize empty DataFrame to store the difference in performance metrics
    all_differences_df = pd.DataFrame()
    
    #Loop through model_training_metrics_list and model_validation_metrics_list indices and calculate the difference in performance metrics
    for index in range(len(model_training_metrics_list)):
        model_training_metrics_df = model_training_metrics_list.iloc[index]
        model_validation_metrics_df = model_validation_metrics_list.iloc[index]
        
        # Calculate the absolute difference in performance metrics
        difference_df = pd.DataFrame({
            'Model': "Model " + str(index),
            'Loss Difference': abs(model_training_metrics_df['Loss'] - model_validation_metrics_df['Loss']),
            'F1 Score Difference': abs(model_training_metrics_df['F1 Score'] - model_validation_metrics_df['F1 Score']),
            'Accuracy Score Difference': abs(model_training_metrics_df['Accuracy Score'] - model_validation_metrics_df['Accuracy Score']),
            'Recall Score Difference': abs(model_training_metrics_df['Recall Score'] - model_validation_metrics_df['Recall Score']),
            'Precision Score Difference': abs(model_training_metrics_df['Precision Score'] - model_validation_metrics_df['Precision Score'])
        }, index=[0])

        # Append the difference DataFrame to the all_differences_df
        all_differences_df = pd.concat([all_differences_df, difference_df], ignore_index=True)

    # Return the DataFrame containing the differences in performance metrics
    return all_differences_df

def plot_loss_accuracy(history, name):
    """
    Function to plot loss/accuracy

    history: an object which stores the metrics and losses.
    name: can be one of Loss or Accuracy
    """
    fig, ax = plt.subplots() #Creating a subplot with figure and axes.
    plt.plot(history.history[name]) #Plotting the train accuracy or train loss
    plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss

    plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
    plt.ylabel(name.capitalize()) #Capitalizing the first letter.
    plt.xlabel('Epoch') #Defining the label for the x-axis.
    fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
In [18]:
from sklearn.utils import class_weight

# Function to calculate performance metrics of a model
# Parameters:
# model: The trained model to evaluate.
# X: Features used for prediction.
# y: True labels for the features.
# model_name: Name of the model for identification in the output DataFrame.
# Returns:
# A DataFrame containing the performance metrics of the model, including F1 score, accuracy, recall, and precision.
def performance_metrics(model, X, y, model_name="Default"):
    y_pred = model.predict(X) > THRESHOLD

    f1 = f1_score(y, y_pred)
    accuracy = accuracy_score(y, y_pred)
    recall = recall_score(y, y_pred)
    precision  = precision_score(y, y_pred)

    metrics_df = pd.DataFrame({
        'Model': [model_name],
        'F1 Score': [f1],
        'Accuracy Score': [accuracy],
        'Recall Score': [recall],
        'Precision Score': [precision]
    })

    return metrics_df

def get_binary_prediction_value(y_pred):
    return (y_pred > THRESHOLD).astype(int).flatten()

def evaluate_neural_network_on_recall(
      model,
      x,
      y,
      validation_data,
      epochs=EPOCHS,
      batch_size=BATCH_SIZE,
      model_name="Default",
      plot_loss_graph=False
):
    # Always Build the model against the specified training data and use the epochs, batch_size, and validation data passed into this function.
    history = model.fit(X_train,
                        y_train,
                        epochs=epochs,
                        batch_size = batch_size,
                        validation_data=validation_data,
                        class_weight = CLASS_WEIGHTS,                
                        verbose=0)

    if plot_loss_graph:
        plot_loss_accuracy(history, 'loss')

    y_pred = get_binary_prediction_value(model.predict(x))

    # Calculate the loss and recall of the model
    loss, recall = model.evaluate(x, y, verbose=0)

    f1 = f1_score(y, y_pred)
    accuracy = accuracy_score(y, y_pred)
    recall = recall_score(y, y_pred)
    precision  = precision_score(y, y_pred)

    # To make things cleaner create a DataFrame out of the classification report as a dictionary
    report = classification_report(y, y_pred, output_dict=True)
    report_df = pd.DataFrame(report).transpose()

    # Add a caption/title using Styler
    styled_report_df = report_df.style.set_caption(model_name + " Classification Report")
    
    metrics_df = pd.DataFrame({
        'Model': [model_name],
        'Loss': [loss],
        'F1 Score': [f1],
        'Accuracy Score': [accuracy],
        'Recall Score': [recall],
        'Precision Score': [precision]
    })
    
    #Return the metrics and report DataFrames
    return metrics_df, styled_report_df

def generate_model_reports(model, model_name="Default"):
    # clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
    tf.keras.backend.clear_session()

    model_training_metrics_df, model_training_report_df  = evaluate_neural_network_on_recall(
        model = model,
        x = X_train,
        y = y_train,
        validation_data=(X_val, y_val),
        model_name=model_name + " Training",
        plot_loss_graph=True
    )

    model_validation_metrics_df, model_validation_report_df = evaluate_neural_network_on_recall(
        model = model,
        x = X_val,
        y = y_val,
        validation_data=(X_val, y_val),
        model_name= model_name + " Validation",
    )

    model.summary()

    display(model_training_metrics_df)
    display(model_validation_metrics_df)

    display(model_training_report_df)
    display(model_validation_report_df)
    
    return model_training_metrics_df, model_validation_metrics_df, model_training_report_df, model_validation_report_df   

Model Evaluation Criterion¶

  • Since the largest expense to the company is misclassifying failures that would result in the most cost replacement fee, the model should focus on maximizing detection of actual failures (True Positives). As a result the performance metric that the model will focus on is recall.

Initial Model Building (Model 0)¶

  • Let's start with a neural network consisting of
    • just one hidden layer
    • activation function of ReLU
    • SGD as the optimizer
In [19]:
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()

model0 = Sequential()
model0.add(Dense(20, activation='relu', input_dim=X_train.shape[1]))
# Sigmoid output layer for binary classification
model0.add(Dense(1, activation='sigmoid'))  
model0.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model0_training_metrics_df, model0_validation_metrics_df, model0_training_report_df, model0_validation_report_df = generate_model_reports(
    model=model0,
    model_name="Model 0"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 0s 876us/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 921us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │            21 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 843 (3.30 KB)
 Trainable params: 841 (3.29 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 2 (12.00 B)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 0 Training 0.0908 0.846585 0.981733 0.907563 0.793284
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 0 Validation 0.111384 0.787402 0.973 0.902527 0.698324
Model 0 Training Classification Report
  precision recall f1-score support
0.0 0.994518 0.986094 0.990289 14167.000000
1.0 0.793284 0.907563 0.846585 833.000000
accuracy 0.981733 0.981733 0.981733 0.981733
macro avg 0.893901 0.946829 0.918437 15000.000000
weighted avg 0.983343 0.981733 0.982308 15000.000000
Model 0 Validation Classification Report
  precision recall f1-score support
0.0 0.994184 0.977133 0.985585 4723.000000
1.0 0.698324 0.902527 0.787402 277.000000
accuracy 0.973000 0.973000 0.973000 0.973000
macro avg 0.846254 0.939830 0.886493 5000.000000
weighted avg 0.977793 0.973000 0.974605 5000.000000
No description has been provided for this image
  • On the initial model setup we achieve a recall score of .884 and .909 on the training and validation sets respectively which is a good start.
  • The model is not overfit at this point.

Model Performance Improvement¶

Model 1¶

In [20]:
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()

# Define a more complex model with additional layers
# This model has two hidden layers with 40 and 20 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model1 = Sequential()
model1.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model1.add(Dense(20, activation='relu'))
# Sigmoid output layer for binary classification
model1.add(Dense(1, activation='sigmoid'))  
model1.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model1_training_metrics_df, model1_validation_metrics_df, model1_training_report_df, model1_validation_report_df = generate_model_reports(
    model=model1,
    model_name="Model 1"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 0s 928us/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 983us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            21 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 2,483 (9.70 KB)
 Trainable params: 2,481 (9.69 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 2 (12.00 B)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 1 Training 0.038931 0.936937 0.993 0.936375 0.9375
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 1 Validation 0.119307 0.797428 0.9748 0.895307 0.718841
Model 1 Training Classification Report
  precision recall f1-score support
0.0 0.996259 0.996329 0.996294 14167.000000
1.0 0.937500 0.936375 0.936937 833.000000
accuracy 0.993000 0.993000 0.993000 0.993000
macro avg 0.966880 0.966352 0.966616 15000.000000
weighted avg 0.992996 0.993000 0.992998 15000.000000
Model 1 Validation Classification Report
  precision recall f1-score support
0.0 0.993770 0.979462 0.986564 4723.000000
1.0 0.718841 0.895307 0.797428 277.000000
accuracy 0.974800 0.974800 0.974800 0.974800
macro avg 0.856305 0.937385 0.891996 5000.000000
weighted avg 0.978539 0.974800 0.976086 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .943 and .898 on the training and validation sets respectively which is a marginal improvement only on the training set when adding an additional hidden layer.
  • The model is not overfit at this point.
  • The loss values converge at around 13 and 15 epochs.

Model 2¶

In [21]:
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()

# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model2 = Sequential()
model2.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model2.add(Dense(20, activation='relu'))
model2.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model2.add(Dense(1, activation='sigmoid'))  
model2.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model2_training_metrics_df, model2_validation_metrics_df, model2_training_report_df, model2_validation_report_df = generate_model_reports(
    model=model2,
    model_name="Model 2"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 983us/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 960us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           210 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 2,683 (10.48 KB)
 Trainable params: 2,681 (10.47 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 2 (12.00 B)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 2 Training 0.069391 0.882751 0.986133 0.939976 0.832094
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 2 Validation 0.090182 0.847059 0.9818 0.909747 0.792453
Model 2 Training Classification Report
  precision recall f1-score support
0.0 0.996444 0.988847 0.992631 14167.000000
1.0 0.832094 0.939976 0.882751 833.000000
accuracy 0.986133 0.986133 0.986133 0.986133
macro avg 0.914269 0.964412 0.937691 15000.000000
weighted avg 0.987317 0.986133 0.986529 15000.000000
Model 2 Validation Classification Report
  precision recall f1-score support
0.0 0.994660 0.986026 0.990324 4723.000000
1.0 0.792453 0.909747 0.847059 277.000000
accuracy 0.981800 0.981800 0.981800 0.981800
macro avg 0.893557 0.947887 0.918692 5000.000000
weighted avg 0.983458 0.981800 0.982387 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .935 and .891 on the training and validation sets respectively. The training set performs about the same and the validation set performs about the same as the previous model.
  • The model is not overfit at this point and the training and validation recall scores are also very close to each other. Adding the additional model complexity with the addition of a 3rd layer doesn't really improve the performance.

Model 3¶

In [22]:
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)

# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model3 = Sequential()
model3.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model3.add(Dense(20, activation='relu'))
model3.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model3.add(Dense(1, activation='sigmoid'))  
model3.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model3_training_metrics_df, model3_validation_metrics_df, model3_training_report_df, model3_validation_report_df = generate_model_reports(
    model=model3,
    model_name="Model 3"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 962us/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 963us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           210 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 8,045 (31.43 KB)
 Trainable params: 2,681 (10.47 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 5,364 (20.96 KB)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 3 Training 0.048698 0.934211 0.992667 0.937575 0.93087
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 3 Validation 0.107645 0.856164 0.9832 0.902527 0.814332
Model 3 Training Classification Report
  precision recall f1-score support
0.0 0.996328 0.995906 0.996117 14167.000000
1.0 0.930870 0.937575 0.934211 833.000000
accuracy 0.992667 0.992667 0.992667 0.992667
macro avg 0.963599 0.966741 0.965164 15000.000000
weighted avg 0.992693 0.992667 0.992679 15000.000000
Model 3 Validation Classification Report
  precision recall f1-score support
0.0 0.994247 0.987931 0.991079 4723.000000
1.0 0.814332 0.902527 0.856164 277.000000
accuracy 0.983200 0.983200 0.983200 0.983200
macro avg 0.904289 0.945229 0.923622 5000.000000
weighted avg 0.984279 0.983200 0.983605 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .951 and .906 on the training and validation sets respectively. The training set performs better and the validation set performs better than the previous model.
  • Adding the additional model complexity with the addition of a 3rd layer improves the performance slightly.

Model 4¶

In [23]:
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)

# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The second layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
model4 = Sequential()
model4.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model4.add(Dense(20, activation='relu'))
 #Add batch normalization layer after the 2nd layer
model4.add(BatchNormalization())
model4.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model4.add(Dense(1, activation='sigmoid'))  
model4.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model4_training_metrics_df, model4_validation_metrics_df, model4_training_report_df, model4_validation_report_df = generate_model_reports(
    model=model4,
    model_name="Model 4"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ batch_normalization             │ (None, 20)             │            80 │
│ (BatchNormalization)            │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           210 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 8,205 (32.05 KB)
 Trainable params: 2,721 (10.63 KB)
 Non-trainable params: 40 (160.00 B)
 Optimizer params: 5,444 (21.27 KB)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 4 Training 0.050145 0.925391 0.991733 0.923169 0.927624
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 4 Validation 0.075911 0.870466 0.985 0.909747 0.834437
Model 4 Training Classification Report
  precision recall f1-score support
0.0 0.995484 0.995765 0.995624 14167.000000
1.0 0.927624 0.923169 0.925391 833.000000
accuracy 0.991733 0.991733 0.991733 0.991733
macro avg 0.961554 0.959467 0.960508 15000.000000
weighted avg 0.991715 0.991733 0.991724 15000.000000
Model 4 Validation Classification Report
  precision recall f1-score support
0.0 0.994679 0.989414 0.992039 4723.000000
1.0 0.834437 0.909747 0.870466 277.000000
accuracy 0.985000 0.985000 0.985000 0.985000
macro avg 0.914558 0.949580 0.931253 5000.000000
weighted avg 0.985801 0.985000 0.985304 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .923 and .909 on the training and validation sets respectively. The training set and validation set performs slightly worse than the previous model.
  • Adding the additional model complexity with the addition of a 3rd layer and Batch Normalization does not help as much to improve performance on the model overall.

Model 5¶

In [24]:
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)

# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The second layer has a dropout layer added to prevent overfitting.
# The third layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
# Note: The dropout rate is set to 0.3, which means 30% of the neurons will be randomly dropped during training.
model5 = Sequential()
model5.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model5.add(Dense(20, activation='relu'))
# Add dropout layer after the 2nd layer to prevent overfitting
model5.add(Dropout(0.3))
model5.add(Dense(10, activation='relu'))
 #Add batch normalization layer after the 3rd layer
model5.add(BatchNormalization())
# Sigmoid output layer for binary classification
model5.add(Dense(1, activation='sigmoid'))  
model5.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model5_training_metrics_df, model5_validation_metrics_df, model5_training_report_df, model5_validation_report_df = generate_model_reports(
    model=model5,
    model_name="Model 5"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 20)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           210 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ batch_normalization             │ (None, 10)             │            40 │
│ (BatchNormalization)            │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 8,125 (31.74 KB)
 Trainable params: 2,701 (10.55 KB)
 Non-trainable params: 20 (80.00 B)
 Optimizer params: 5,404 (21.11 KB)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 5 Training 0.068797 0.91432 0.990467 0.915966 0.912679
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 5 Validation 0.098513 0.918033 0.991 0.909747 0.926471
Model 5 Training Classification Report
  precision recall f1-score support
0.0 0.995058 0.994847 0.994953 14167.000000
1.0 0.912679 0.915966 0.914320 833.000000
accuracy 0.990467 0.990467 0.990467 0.990467
macro avg 0.953869 0.955407 0.954636 15000.000000
weighted avg 0.990483 0.990467 0.990475 15000.000000
Model 5 Validation Classification Report
  precision recall f1-score support
0.0 0.994712 0.995765 0.995239 4723.000000
1.0 0.926471 0.909747 0.918033 277.000000
accuracy 0.991000 0.991000 0.991000 0.991000
macro avg 0.960591 0.952756 0.956636 5000.000000
weighted avg 0.990932 0.991000 0.990961 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .913 and .909 on the training and validation sets respectively. The training set and validation set both perform about the same as the previous model and maybe slightly worse.

Model 6¶

In [25]:
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)

# Define a more complex model with additional layers
# This model has four hidden layers with 40, 20, 10, and 5 neurons respectively, using ReLU activation functions.
# The second and fourth layers have dropout layers added to prevent overfitting.
# The third layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
# Note: The dropout rate is set to 0.3, which means 30% of the neurons will be randomly dropped during training.
model6 = Sequential()
model6.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model6.add(Dense(20, activation='relu'))
# Add dropout layer after the 2nd layer to prevent overfitting
model6.add(Dropout(0.3))
model6.add(Dense(10, activation='relu'))
 #Add batch normalization layer after the 3rd layer
model6.add(BatchNormalization())
model6.add(Dense(5, activation='relu'))
# Add dropout layer after the 4th layer to prevent overfitting
model6.add(Dropout(0.3))
# Sigmoid output layer for binary classification
model6.add(Dense(1, activation='sigmoid'))
model6.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['Recall'])

# Generate the model reports
model6_training_metrics_df, model6_validation_metrics_df, model6_training_report_df, model6_validation_report_df = generate_model_reports(
    model=model6,
    model_name="Model 6"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 40)             │         1,640 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │           820 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 20)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 10)             │           210 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ batch_normalization             │ (None, 10)             │            40 │
│ (BatchNormalization)            │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 5)              │            55 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 5)              │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 1)              │             6 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 8,275 (32.33 KB)
 Trainable params: 2,751 (10.75 KB)
 Non-trainable params: 20 (80.00 B)
 Optimizer params: 5,504 (21.50 KB)
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 6 Training 0.079933 0.919374 0.991067 0.917167 0.921592
Model Loss F1 Score Accuracy Score Recall Score Precision Score
0 Model 6 Validation 0.073594 0.904505 0.9894 0.906137 0.902878
Model 6 Training Classification Report
  precision recall f1-score support
0.0 0.995131 0.995412 0.995271 14167.000000
1.0 0.921592 0.917167 0.919374 833.000000
accuracy 0.991067 0.991067 0.991067 0.991067
macro avg 0.958362 0.956289 0.957323 15000.000000
weighted avg 0.991047 0.991067 0.991057 15000.000000
Model 6 Validation Classification Report
  precision recall f1-score support
0.0 0.994494 0.994283 0.994389 4723.000000
1.0 0.902878 0.906137 0.904505 277.000000
accuracy 0.989400 0.989400 0.989400 0.989400
macro avg 0.948686 0.950210 0.949447 5000.000000
weighted avg 0.989418 0.989400 0.989409 5000.000000
No description has been provided for this image
  • The model achieves a recall score of .914 and .909 on the training and validation sets respectively. The training set and validation both perform about the same as the previous model.
  • Adding the additional drop out layers, additional hidden layers, and normalization layers is making the model overly complex without adding any value.

Model Performance Comparison and Final Model Selection¶

Now, in order to select the final model, we will compare the performances of all the models for the training and validation sets.

In [26]:
# Collect all of the training model metrics DataFrames into a single DataFrame
model_training_metrics_list = pd.concat([
    model0_training_metrics_df,
    model1_training_metrics_df,
    model2_training_metrics_df,
    model3_training_metrics_df,
    model4_training_metrics_df,
    model5_training_metrics_df,
    model6_training_metrics_df
], ignore_index=True)

# Collect all of the validation model metrics DataFrames into a single DataFrame
model_validation_metrics_list = pd.concat([
    model0_validation_metrics_df,
    model1_validation_metrics_df,
    model2_validation_metrics_df,
    model3_validation_metrics_df,
    model4_validation_metrics_df,
    model5_validation_metrics_df,
    model6_validation_metrics_df
], ignore_index=True)

#Calculate the difference in performance metrics between the training and validation model metrics DataFrames
model_metrics_difference_df = performance_metrics_difference(
    model_training_metrics_list,
    model_validation_metrics_list
)

#Sort the model metrics difference DataFrame by the Recall Score Difference in ascending order
model_metrics_difference_df = model_metrics_difference_df.sort_values(by='Recall Score Difference')

display(model_training_metrics_list.T)
display(model_validation_metrics_list.T)
display(model_metrics_difference_df.T)

#Print the best model based on the validation recall score and that has the smallest difference in recall score between the training and validation sets
display(model_metrics_difference_df.iloc[0].T)
0 1 2 3 4 5 6
Model Model 0 Training Model 1 Training Model 2 Training Model 3 Training Model 4 Training Model 5 Training Model 6 Training
Loss 0.0908 0.038931 0.069391 0.048698 0.050145 0.068797 0.079933
F1 Score 0.846585 0.936937 0.882751 0.934211 0.925391 0.91432 0.919374
Accuracy Score 0.981733 0.993 0.986133 0.992667 0.991733 0.990467 0.991067
Recall Score 0.907563 0.936375 0.939976 0.937575 0.923169 0.915966 0.917167
Precision Score 0.793284 0.9375 0.832094 0.93087 0.927624 0.912679 0.921592
0 1 2 3 4 5 6
Model Model 0 Validation Model 1 Validation Model 2 Validation Model 3 Validation Model 4 Validation Model 5 Validation Model 6 Validation
Loss 0.111384 0.119307 0.090182 0.107645 0.075911 0.098513 0.073594
F1 Score 0.787402 0.797428 0.847059 0.856164 0.870466 0.918033 0.904505
Accuracy Score 0.973 0.9748 0.9818 0.9832 0.985 0.991 0.9894
Recall Score 0.902527 0.895307 0.909747 0.902527 0.909747 0.909747 0.906137
Precision Score 0.698324 0.718841 0.792453 0.814332 0.834437 0.926471 0.902878
0 5 6 4 2 3 1
Model Model 0 Model 5 Model 6 Model 4 Model 2 Model 3 Model 1
Loss Difference 0.020584 0.029716 0.006339 0.025766 0.02079 0.058947 0.080376
F1 Score Difference 0.059183 0.003713 0.01487 0.054925 0.035692 0.078046 0.139509
Accuracy Score Difference 0.008733 0.000533 0.001667 0.006733 0.004333 0.009467 0.0182
Recall Score Difference 0.005036 0.006219 0.01103 0.013422 0.030229 0.035048 0.041068
Precision Score Difference 0.09496 0.013791 0.018715 0.093187 0.039641 0.116538 0.218659
Model                          Model 0
Loss Difference               0.020584
F1 Score Difference           0.059183
Accuracy Score Difference     0.008733
Recall Score Difference       0.005036
Precision Score Difference     0.09496
Name: 0, dtype: object
  • After calculating the differences between the training and validation sets across the recall score, model #0 had the least difference between the training and validation set. So this will be the model that we will evaluate against our testing set.

Now, let's check the performance of the final model on the test set.

In [30]:
# Calculate the performance metrics of the best model on the test set
best_model_test_perf = performance_metrics(model0,X_test,y_test,"Model 0 Test")

y_test_pred_best = model0.predict(X_test)

 # To make things cleaner create a DataFrame out of the classification report as a dictionary
report = classification_report(y_test, y_test_pred_best>THRESHOLD, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Add a caption/title using Styler
styled_report_df = report_df.style.set_caption("Model 0 Test Set Classification Report")

display(best_model_test_perf)
display(styled_report_df)
draw_confusion_matrix(model0, X_test, y_test, "Model 0 Test Set Confusion Matrix")
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 997us/step
Model F1 Score Accuracy Score Recall Score Precision Score
0 Model 0 Test 0.767802 0.97 0.879433 0.681319
Model 0 Test Set Classification Report
  precision recall f1-score support
0.0 0.992666 0.975413 0.983964 4718.000000
1.0 0.681319 0.879433 0.767802 282.000000
accuracy 0.970000 0.970000 0.970000 0.970000
macro avg 0.836992 0.927423 0.875883 5000.000000
weighted avg 0.975106 0.970000 0.971773 5000.000000
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
No description has been provided for this image

Actionable Insights and Recommendations¶

  • The model is able to accurately detect ~87% of the wind turbines that are about to fail. Based on the confusion matrix, it correctly predicts the failures of 248 out of 282 of the actual failures.
  • As stated previously the primary metric that the modeling focused on was recall since the biggest financial risk to the company is the wind turbine replacement cost. So staying ahead of this is a safer bet.
  • In the worst case scenario if the model classifies a wind turbines as about to fail but doesn't, it costs the company the least amount of money because it only requires an inspection fee vs a replacement fee.
  • Since the input parameters are redacted for security reasons, the best that we were able to do was to indicate based on the bi-variate analysis which input parameters were highly correlated to each other. However, there was no high correlation of any of the input values on the Target output at all. The business will know best on what those exact parameters are.
  • While the neural network model does a great job of capturing complex interactions between the 40 input parameters, this same ability to adapt towards complexity does not grant us the ability to describe feature(input) importance as we can do with other types of machine learning algorithms.
  • One of the major challenges with the data set is that it was heavily skewed towards non-failures, so this definitely contributed to keeping the model prediction within the high 80th percentile range of true failure detection. If possible, it would be helpful to gather more failure scenarios in the data set and to be able to have a more balanced data set in general for the model to learn from. Other things that we could try in the future would be to use techniques such as SMOTE to synthetically create failure scenarios to fill out the data set.
  • Lastly, it would be helpful to understand from the business based on this first generation model if there is an ideal target prediction performance that the business would like the model to achieve.

Export the project to HTML¶

In [33]:
# Export the project to HTML

!jupyter nbconvert --to html "ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.ipynb"
[NbConvertApp] Converting notebook ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 50 image(s).
[NbConvertApp] Writing 4976155 bytes to ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.html